CUSTOMER REVIEWS SENTIMENTAL ANALYSIS

1. Problem Statement

The goal of this project is to learn how to perform sentiment analysis in Python. We will be using Natural Language Processing (NLP) technique to discover negative customer ratings evaluations based on data collected from the hotel reviews.

The dataset includes consumer ratings and textual feedback about their hotel experiences. We will be using NLP that involves extracting emotions from raw texts. This is commonly used on social media posts and customer reviews to automatically determine whether some individuals are favourable or negative, as well as why. Utilizing only raw textual review data, we will attempt to forecast information.

2. Data Loading

First step is we load the data. I have uploaded the dataset on my google drive.

2.1 Importing Libraries

2.2 Reading data

2.3 Data Overview

3. Data Pre-Processing

3.1 Data Cleaning

We clean the data using the below steps.

We call our function clean_data which performs a few operations -

2.2 Feature engineering

Since, sentiment analysis features are most important, we start by adding them.

For each of the texts, Vader returns 4 values: positivity scoreneutrality score negativity scorean overall score that summarizes the previous scores

We first train doc2vec vector columns by feeding our data. With this, we get our representation vectors.

Every word and document has TF-IDF (Term-- Frequency — Inverse Document-- Frequency) values..

3. Exploratory data analysis

Doing EDA on our data.

3.1 About the Dataset

This dataset contains-- 515,738 customer reviews and scoring of-- 1493 Luxury Hotels across Europe. The csv file contains 17 fields. The description of each field is as below:

3.2 Distribution of is_review_bad column

Since our dataset is highly imbalanced (only 5% is negative), this information is highly useful for modelling.

3.3 Distribution of Average Score by Hotel Name

We see that most of the Hotels average_score lie in the range of --8.0 and 9.1 --range.

3.4 Count of reviews per nationality

3.5 Hotel Location on World Map

3.6 Review Sentiment Distribution

3.7 Wordcloud

The majority of the words above are hotel-related -> room, breakfast, staff, and so on..

The remaining few are associated with the client experience -> pricey, appreciated, loved, and so on.

3.8 Highest positive sentiment reviews

The most positive ones do correspond to the pleasant feedback.

3.9 Lowest negative sentiment reviews

3.10 Sentiment Distribution for positive and negative reviews

The graph above depicts the distribution of positive and negative reviews sentiment. We can see that Vader thinks favorable reviews are a good thing. Furthermore, negative evaluations receive poor sentiment scores.

4. Modelling

Modelling the dependant variable (is_review_bad)

4.1Test Train split

For model trainnig, we choose which features to employ. The data was then divided into two portions -

We implement a Random Forest-- classifier-- for our predictions.

4.2 Random Forest

4.3 Feature Importance

4.4 ROC Curve

The ROC curve is typically an honest graph to summarize the standard of our classifier. The higher the curve is on top of the diagonal baseline, the more good predictions we get. But AUC_ROC should not be only criteria to assess quality of model.

Our model will be able to predict a large number of false positives while maintaining a low false positive rate, increasing the true positive rate and therefore artificially improving the AUC ROC measure.

4.6 Logistic Regression

4.6 XGBoost

4.7 Comparing Model Accuracy

Recommendation engine

5. Conclusion

It is feasible to make predictions using simply raw text as input. Extracting meaningful features from this raw data is the most critical step. This is frequently a useful supplement to our data projects, allowing us to extract more learning features and improve the predictive power of our models.

6. References

https://towardsdatascience.com/a-complete-sentiment-analysis-project-using-pythons-scikit-learn-b9ccbb0405c2

https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe